摘要 :
Motivated by the impressive predictive power of simple patterns, we consider the problem of mining frequent subtrees in arbitrary graphs. Although the restriction of the pattern language to trees does not resolve the computational...
展开
Motivated by the impressive predictive power of simple patterns, we consider the problem of mining frequent subtrees in arbitrary graphs. Although the restriction of the pattern language to trees does not resolve the computational complexity of frequent subgraph mining, in a recent work we have shown that it gives rise to an algorithm generating probabilistic frequent subtrees, a random subset of all frequent subtrees, from arbitrary graphs with polynomial delay. It is based on replacing each transaction graph in the input database with a forest formed by a random subset of its spanning trees. This simple technique turned out to be quite powerful on molecule classification tasks. It has, however, the drawback that the number of sampled spanning trees must be bounded by a polynomial of the size of the transaction graphs, resulting in less impressive recall even for slightly more complex structures beyond molecular graphs. To overcome this limitation, in this work we propose an algorithm mining probabilistic frequent subtrees also with polynomial delay, but by replacing each graph with a forest formed by an exponentially large implicit subset of its spanning trees. We demonstrate the superiority of our algorithm over the simple one on threshold graphs used e.g. in spectral clustering. In addition, providing sufficient conditions for the completeness and efficiency of our algorithm, we obtain a positive complexity result on exact frequent subtree mining for a novel, practically and theoretically relevant graph class that is orthogonal to all graph classes defined by some constant bound on monotone graph properties.
收起
摘要 :
Existing algorithms of mining frequent XML query patterns (XQPs)employ a candidate generate-and-test strategy. They involve expensive candidate enumeration and costly tree-containment checking. Further, most of existing methods co...
展开
Existing algorithms of mining frequent XML query patterns (XQPs)employ a candidate generate-and-test strategy. They involve expensive candidate enumeration and costly tree-containment checking. Further, most of existing methods compute the frequencies of candidate query patterns from scratch periodically by checking the entire transaction database, which consists of XQPs transferred from user query logs. However, it is not straightforward to maintain such discovered frequent patterns in real XML databases as there may be frequent updates that may not only invalidate some existing frequent query patterns but also generate some new frequent query patterns. Therefore, a drawback of existingmethods is that they are rather inefficient for the evolution of transaction databases. To address above-mentioned problems,this paper proposes an efficient algorithm ESPRIT to mine frequent XQPs without costly tree-containment checking. ESPRIT transforms XML queries into sequences using a one-to-one mapping technique and mines the frequent sequences to generate frequent XQPs. We propose two efficient incremental algorithms, ESPRIT-i and ESPRIT-i+, to incrementally mine frequent XQPs. We devise several novel optimization techniques of query rewriting, cache lookup, and cache replacement to improve the answerability and the hit rate of caching. We have implemented our algorithms and conducted a set of experimental studies on various datasets. The experimental results demonstrate that our algorithms achieve high efficiency and scalability and outperform state-of-the-art methods significantly.
收起
摘要 :
Existing algorithms of mining frequent XML query patterns (XQPs) employ a candidate generate-and-test strategy. They involve expensive candidate enumeration and costly tree-containment checking. Further, most of existing methods c...
展开
Existing algorithms of mining frequent XML query patterns (XQPs) employ a candidate generate-and-test strategy. They involve expensive candidate enumeration and costly tree-containment checking. Further, most of existing methods compute the frequencies of candidate query patterns from scratch periodically by checking the entire transaction database, which consists of XQPs transferred from user query logs. However, it is not straightforward to maintain such discovered frequent patterns in real XML databases as there may be frequent updates that may not only invalidate some existing frequent query patterns but also generate some new frequent query patterns. Therefore, a drawback of existing methods is that they are rather inefficient for the evolution of transaction databases. To address above-mentioned problems, this paper proposes an efficient algorithm ESPRIT to mine frequent XQPs without costly tree-containment checking. ESPRIT transforms XML queries into sequences using a one-to-one mapping technique and mines the frequent sequences to generate frequent XQPs. We propose two efficient incremental algorithms, ESPRIT-i and ESPRIT-i +, to incrementally mine frequent XQPs. We devise several novel optimization techniques of query rewriting, cache lookup, and cache replacement to improve the answerability and the hit rate of caching. We have implemented our algorithms and conducted a set of experimental studies on various datasets. The experimental results demonstrate that our algorithms achieve high efficiency and scalability and outperform state-of-the-art methods significantly.
收起
摘要 :
Maximal frequent pattern mining has been suggested for data mining to avoid generating a huge set of frequent patterns. Conversely, weighted frequent pattern mining has been proposed to discover important frequent patterns by cons...
展开
Maximal frequent pattern mining has been suggested for data mining to avoid generating a huge set of frequent patterns. Conversely, weighted frequent pattern mining has been proposed to discover important frequent patterns by considering the weighted support. We propose two mining algorithms of maximal correlated weight frequent pattern (MCWP), termed MCWP(WA) (based on Weight Ascending order) and MCWP(SD) (based on Support Descending order), to mine a compact and meaningful set of frequent patterns. MCWP(SD) obtains an advantage in conditional database access, but may not obtain the highest weighted item of the conditional database to mine highly correlated weight frequent patterns. Thus, we suggest a technique that uses additional conditions to prune lowly correlated weight items before the subsets checking process. Analyses show that our algorithms are efficient and scalable.
收起
摘要 :
Top-K Uncertain Frequent Pattern (UFP) mining is an interesting topic in data mining. The existing TUFP algorithm supports static mining of Top-K UFPs; however, in the real world, users need to repeatedly change the K threshold to...
展开
Top-K Uncertain Frequent Pattern (UFP) mining is an interesting topic in data mining. The existing TUFP algorithm supports static mining of Top-K UFPs; however, in the real world, users need to repeatedly change the K threshold to extract the information according to the requirements of their application. In interactive environments, the TUFP algorithm needs to re-scan the database and create the UP-Lists and CUP-Lists from scratch which is very time-consuming. In this paper, a fast method called ITUFP is proposed for interactive mining of Top-K UFPs. The proposed method uses a new data structure called IMCUP-List to store information of patterns efficiently. It creates the UP-Lists with a single database scan, extracts the patterns by generating IMCUP-Lists, and stores all the lists. When K changes, the proposed algorithm only updates the IMCUP-Lists without having to create the lists from scratch. Accordingly, ITUFP conforms to the "build once, mine many" principle, where the UP-Lists and IMCUP-Lists are created only once and used in mining with different K values. This is the first study on interactive mining of Top-K UFPs. Extensive experimental results with sparse and dense uncertain data prove that the proposed method is very efficient for interactive mining of Top-K UFPs.
收起
摘要 :
In recent years, frequent pattern mining from uncertain data has been actively researched in data mining. There are numerous exact and upper bound-based approaches for uncertain frequent pattern mining. Exact-based algorithms may ...
展开
In recent years, frequent pattern mining from uncertain data has been actively researched in data mining. There are numerous exact and upper bound-based approaches for uncertain frequent pattern mining. Exact-based algorithms may produce a large data structure and need time-consuming calculations and upper bound-based algorithms may produce many false positives. As a result, these algorithms demand much time and memory. There have been efforts to resolve the problem of upper bound-based algorithms, however, all of these methods only try to tighten the upper bound of expected support for long patterns. This is while pruning infrequent short patterns has a greater impact on reducing the false positives. To overcome these drawbacks, in this paper an efficient method based on upper bound is proposed for mining uncertain frequent patterns. The proposed method uses a new Tightened upper bound to expected support of patterns (Tup) which has a significant effect on reducing the number of false positives by tightening the upper bound of expected support and early pruning of infrequent 2-itemsets and their supersets. Comprehensive experimental results show that the proposed method reduces memory consumption in most cases and dramatically improves the performance of exact and upper bound-based methods in terms of runtime and scalability for dense and sparse uncertain data.
收起
摘要 :
Frequent-pattern mining has been studied extensively and has many useful applications. However, frequent-pattern mining often generates too many patterns to be truly efficient or effective. In many applications, it is sufficient t...
展开
Frequent-pattern mining has been studied extensively and has many useful applications. However, frequent-pattern mining often generates too many patterns to be truly efficient or effective. In many applications, it is sufficient to generate and examine frequent patterns with a sufficiently good approximation of the support frequency instead of in full precision. Such a compact but “close-enough” frequent-pattern base is called a condensed frequent-pattern base.
收起
摘要 :
Frequent-pattern mining has been studied extensively and has many useful applications. However, frequent-pattern mining often generates too many patterns to be truly efficient or effective. In many applications, it is sufficient t...
展开
Frequent-pattern mining has been studied extensively and has many useful applications. However, frequent-pattern mining often generates too many patterns to be truly efficient or effective. In many applications, it is sufficient to generate and examine frequent patterns with a sufficiently good approximation of the support frequency instead of in full precision. Such a compact but close-enough frequent-pattern base is called a condensed frequent-pattern base. In this paper, we propose and examine several alternatives for the design, representation, and implementation of such condensed frequent-pattern bases. Several algorithms for computing such pattern bases are proposed. Their effectiveness at pattern compression and methods for efficiently computing them are investigated. A systematic performance study is conducted on different kinds of databases, and demonstrates the effectiveness and efficiency of our approach in handling frequent-pattern mining in large databases.
收起
摘要 :
The regular frequent pattern mining (RFPM) approaches are aimed to discover the itemsets with significant frequency and regular occurrence behavior in a dataset. However, these approaches mainly suffer from the following two issue...
展开
The regular frequent pattern mining (RFPM) approaches are aimed to discover the itemsets with significant frequency and regular occurrence behavior in a dataset. However, these approaches mainly suffer from the following two issues: (1) setting the frequency threshold parameter for the discovery of regular frequent patterns technique is not an easy task because of its dependency on the characteristics of a dataset, and (2) RFPM approaches are designed to mine patterns from the static datasets and are not able to mine dynamic datasets. This paper aims to solve these two issues by proposing a novel top-K identical frequent regular patterns mining (TKIFRPM) approach to function on online datasets. The TKIFRPM maintains a novel synopsis data structure with item support index tables (ISI-tables) to keep summarized information about online committed transactions and dataset updates. The mining operation can discover top-K regular frequent patterns from online data stored in the ISI-tables. The TKIFRPM explores the search space in recursive depth-first order and applies a novel progressive node’s sub-tree pruning strategy to rapidly eliminate a complete infrequent sub-tree from the search space. The TKIFRPM is compared with the MTKPP approach, and it found that it outperforms its counterpart in terms of runtime and memory usage to produce designated topmost-K frequent regular pattern mining on the datasets following incremental updates.
收起
摘要 :
A major challenge in frequent-pattern mining is the sheer size of its mining results. To compress the frequent patterns, we propose to cluster frequent patterns with a tightness measure δ (called δ-cluster), and select a represe...
展开
A major challenge in frequent-pattern mining is the sheer size of its mining results. To compress the frequent patterns, we propose to cluster frequent patterns with a tightness measure δ (called δ-cluster), and select a representative pattern for each cluster. The problem of finding a minimum set of representative patterns is shown NP-Hard. We develop two greedy methods, RPglobal and RPlocal. The former has the guaranteed compression bound but higher computational complexity. The latter sacrifices the theoretical bounds but is far more efficient. Our performance study shows that the compression quality using RPlocal is very close to RPglobal, and both can reduce the number of closed frequent patterns by almost two orders of magnitude. Furthermore, RPlocal mines even faster than FPClose [G. Grahne, J. Zhu, Efficiently using prefix-trees in mining frequent itemsets, in: Proc. IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI'03)], a very fast closed frequent-pattern mining method. We also show that RPglobal and RPlocal can be combined together to balance the quality and efficiency.
收起